DATASET
CrimeRate - Per capita crime rate by city Indus - Proportion of non-retail business acres per city. River – River front or not (1-bounds river; 0-otherwise) AvgRoom - Average number of rooms per dwelling per city Age – Average age of a house by city Tax - full-value property-tax rate per $10,000 PTRatio - pupil-teacher ratio by city LStat - % lower status of the population by city MedPrice - Median price of owner-occupied homes in $1000’s
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(leaps)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(nortest)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
cincinnati <- read_csv("CINCINNATI.csv")
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cincinnati)
## # A tibble: 6 × 9
## CrimeRate Indus River AvgRoom Age Tax PTRatio LStat MedPrice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0273 7.07 0 7.18 61 242 17.8 4.03 34.7
## 2 0.254 9.9 0 5.70 78 304 18.4 11.5 16.2
## 3 0.0396 5.19 0 6.04 35 224 20.2 8.01 21.1
## 4 0.145 10.0 0 5.73 65 432 17.8 13.6 19.3
## 5 0.786 3.97 0 7.01 85 264 13 14.8 30.7
## 6 15.3 18.1 0 6.65 93 666 20.2 23.2 13.9
#Check for missing values
any(is.na(cincinnati))
## [1] FALSE
#Check for duplicates
colSums(is.na(cincinnati))
## CrimeRate Indus River AvgRoom Age Tax PTRatio LStat
## 0 0 0 0 0 0 0 0
## MedPrice
## 0
#Check for correct format
str(cincinnati$CrimeRate)
## num [1:150] 0.0273 0.2536 0.0396 0.1448 0.7857 ...
str(cincinnati$Indus)
## num [1:150] 7.07 9.9 5.19 10.01 3.97 ...
str(cincinnati$River)
## num [1:150] 0 0 0 0 0 0 0 0 0 0 ...
str(cincinnati$AvgRoom)
## num [1:150] 7.18 5.71 6.04 5.73 7.01 ...
str(cincinnati$Age)
## num [1:150] 61 78 35 65 85 93 32 88 46 97 ...
str(cincinnati$Tax)
## num [1:150] 242 304 224 432 264 666 300 666 398 403 ...
str(cincinnati$PTRatio)
## num [1:150] 17.8 18.4 20.2 17.8 13 20.2 15.3 20.2 18.7 14.7 ...
str(cincinnati$LStat)
## num [1:150] 4.03 11.5 8.01 13.61 14.79 ...
str(cincinnati$MedPrice)
## num [1:150] 34.7 16.2 21.1 19.3 30.7 13.9 22 14.3 20.8 41.3 ...
#Check for inconsistencies/invalid values
any(cincinnati$CrimeRate < 0)
## [1] FALSE
any(cincinnati$Indus <= 0)
## [1] FALSE
any(cincinnati$River < 0 | cincinnati$River > 1)
## [1] FALSE
any(cincinnati$AvgRoom < 0)
## [1] FALSE
any(cincinnati$Age < 0)
## [1] FALSE
any(cincinnati$Age <= 0)
## [1] FALSE
any(cincinnati$Tax <= 0)
## [1] FALSE
any(cincinnati$PTRatio <= 0)
## [1] FALSE
any(cincinnati$PTRatio <= 0)
## [1] FALSE
any(cincinnati$LStat < 0 | cincinnati$LStat > 100)
## [1] FALSE
any(cincinnati$MedPrice <= 0)
## [1] FALSE
#Look for outliers, determine how to handle them
summary(cincinnati$CrimeRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01096 0.07250 0.19160 3.34698 2.69719 88.97620
boxplot(cincinnati$CrimeRate, main = "Boxplot of Crime Rate")
hist(cincinnati$CrimeRate, breaks = 20, col = "lightcoral", main = "Histogram of Crime Rate")
#Conclusion: There are two high crimerate outliers, but they are still reasonable, and without knowing which city the crime rate is associated with, there's no way to investigate to see if we can justify removing it
summary(cincinnati$Indus)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.740 4.935 8.560 10.475 18.100 25.650
boxplot(cincinnati$Indus, main = "Boxplot of Porportion of Industry")
hist(cincinnati$Indus, breaks = 20, col = "lightcoral", main = "Histogram of Propotion of Industry")
#Conclusion: No outliers
summary(cincinnati$AvgRoom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.880 5.962 6.321 6.341 6.684 8.704
boxplot(cincinnati$AvgRoom, main = "Boxplot of Average Number of Rooms")
hist(cincinnati$AvgRoom, breaks = 20, col = "lightcoral", main = "Histogram of Average Number of Rooms")
#Conclusion: No outliers
summary(cincinnati$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 41.00 69.50 65.07 91.75 100.00
boxplot(cincinnati$Age, main = "Boxplot of Average Number of Rooms")
hist(cincinnati$Age, breaks = 20, col = "lightcoral", main = "Histogram of Average Number of Rooms")
#Conclusion: No outliers
summary(cincinnati$Tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 188.0 284.0 330.0 399.2 437.0 666.0
boxplot(cincinnati$Tax, main = "Boxplot of Tax Rate")
hist(cincinnati$Tax, breaks = 20, col = "lightcoral", main = "Histogram of Tax Rate")
outlier <- subset(cincinnati, Tax == 666)
outlier
## # A tibble: 36 × 9
## CrimeRate Indus River AvgRoom Age Tax PTRatio LStat MedPrice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15.3 18.1 0 6.65 93 666 20.2 23.2 13.9
## 2 5.58 18.1 0 6.44 88 666 20.2 16.2 14.3
## 3 3.68 18.1 0 5.36 96 666 20.2 10.2 20.8
## 4 16.8 18.1 0 5.28 98 666 20.2 30.8 7.2
## 5 3.85 18.1 1 6.40 91 666 20.2 13.3 21.7
## 6 14.3 18.1 0 6.23 88 666 20.2 13.1 21.4
## 7 5.82 18.1 0 6.51 90 666 20.2 10.3 20.2
## 8 9.72 18.1 0 6.41 97 666 20.2 19.5 17.1
## 9 10.1 18.1 0 6.83 94 666 20.2 19.7 14.1
## 10 19.6 18.1 0 7.31 98 666 20.2 13.4 15
## # ℹ 26 more rows
#Conclusion: Initially it looks like there is an outlier on 666, but upon further examination, there are 36 records with this value, meaning it's not an outlier. Need to figure out why this is here.
summary(cincinnati$PTRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 16.65 18.65 18.28 20.20 21.20
boxplot(cincinnati$PTRatio, main = "Boxplot of Pupil-teacher ratio")
hist(cincinnati$PTRatio, breaks = 20, col = "lightcoral", main = "Histogram of Pupil-teacher ratio")
#Conclusion: One entry is a bit low, but not an outlier
summary(cincinnati$LStat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.470 6.992 10.475 11.850 15.027 34.410
boxplot(cincinnati$LStat, main = "Boxplot of Lower Status Proportion")
hist(cincinnati$LStat, breaks = 20, col = "lightcoral", main = "Histogram of Lower Status Proportion")
#Conclusion: No outliers
summary(cincinnati$MedPrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.60 17.73 21.45 22.62 24.77 50.00
boxplot(cincinnati$MedPrice, main = "Boxplot of Median House Price")
hist(cincinnati$MedPrice, breaks = 20, col = "lightcoral", main = "Histogram of Median House Price")
#Conclusion: No outliers
#Analyze relationship between variables
ggpairs(cincinnati)
ggcorr(cincinnati, label = TRUE, label_round = 2, palette = "RdYlBu")
#Biggest impact on Median Home Price:
#-AvgRooms 0.747 (More rooms = more $)
#-LStat -0.751 (Lower status = less $)
#-PTRatio -0.575 (higher ratio = less teachers/student = less $)
#-Indus -0.574 (more non-retail business = less appealing to live at = less $)
#-Tax -0.540 (higher tax rate = lower $)
#-Age -0.482 (the older the house = less $)
#Possible Variable interdependency
#-Tax/CrimeRate 0.538
#-Tax/Indus 0.698
#-Age/Indus 0.671
#-LStat/Indus 0.645
#-LStat/AvgRoom -0.615
#-Tax/Age 0.508
#-Age/LStat 0.654
#-LStat/Tax 0.565
#Check linear regression assumption of linearity
ggplot(cincinnati, aes(x = CrimeRate, y = MedPrice)) +
geom_point() +
labs(title = "Crimerate vs Price",
x = "Crimerate",
y = "Price") +
theme_minimal()
#Result: Negative non-linear
ggplot(cincinnati, aes(x = River, y = MedPrice)) +
geom_point() +
labs(title = "River vs Price",
x = "River",
y = "Price") +
theme_minimal()
#Result: Small negative correlation for on river
ggplot(cincinnati, aes(x = AvgRoom, y = MedPrice)) +
geom_point() +
labs(title = "AvgRoom vs Price",
x = "AvgRoom",
y = "Price") +
theme_minimal()
#Result: Clear Positive Linear
ggplot(cincinnati, aes(x = Age, y = MedPrice)) +
geom_point() +
labs(title = "Age vs Price",
x = "Age",
y = "Price") +
theme_minimal()
#Result: Negative Non-linear
ggplot(cincinnati, aes(x = Tax, y = MedPrice)) +
geom_point() +
labs(title = "Tax vs Price",
x = "Tax",
y = "Price") +
theme_minimal()
#Result: General Negative Correlation, but a weird grouping at the high end of specific tax rate
ggplot(cincinnati, aes(x = PTRatio, y = MedPrice)) +
geom_point() +
labs(title = "PTRatio vs Price",
x = "PTRatio",
y = "Price") +
theme_minimal()
#Result: Loose linear negative
ggplot(cincinnati, aes(x = LStat, y = MedPrice)) +
geom_point() +
labs(title = "LStat vs Price",
x = "LStat",
y = "Price") +
theme_minimal()
#Result: Non-linear negative
library(dplyr); library(tidyr); library(knitr)
demographics <- cincinnati %>%
summarise_if(
is.numeric,
list(
Mean = ~mean(.x, na.rm = TRUE),
Median = ~median(.x, na.rm = TRUE),
SD = ~sd(.x, na.rm = TRUE),
Min = ~min(.x, na.rm = TRUE),
Max = ~max(.x, na.rm = TRUE)
)
) %>%
pivot_longer(
everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_",
values_to = "Value"
) %>%
pivot_wider(
names_from = Statistic,
values_from = Value
)
kable(demographics, digits = 2,
caption = "Descriptive Statistics for Key Variables")
| Variable | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| CrimeRate | 3.35 | 0.19 | 9.43 | 0.01 | 88.98 |
| Indus | 10.47 | 8.56 | 6.68 | 0.74 | 25.65 |
| River | 0.05 | 0.00 | 0.21 | 0.00 | 1.00 |
| AvgRoom | 6.34 | 6.32 | 0.64 | 4.88 | 8.70 |
| Age | 65.07 | 69.50 | 28.28 | 3.00 | 100.00 |
| Tax | 399.20 | 330.00 | 161.90 | 188.00 | 666.00 |
| PTRatio | 18.28 | 18.65 | 2.20 | 12.60 | 21.20 |
| LStat | 11.85 | 10.47 | 6.60 | 2.47 | 34.41 |
| MedPrice | 22.62 | 21.45 | 8.28 | 5.60 | 50.00 |
# count how many houses are riverfront vs. non-riverfront
river_counts <- cincinnati %>%
group_by(River) %>%
summarise(
count = n(),
pct = round(n() / nrow(cincinnati) * 100, 1)
)
river_counts
## # A tibble: 2 × 3
## River count pct
## <dbl> <int> <dbl>
## 1 0 143 95.3
## 2 1 7 4.7
Findings: Only 7 houses are riverfront which is only about 4.6% . Doing further analysis to see if we can drop it.
# Checking to see if riverfront houses are outliers
# Compute IQR thresholds for MedPrice
Q1 <- quantile(cincinnati$MedPrice, 0.25)
Q3 <- quantile(cincinnati$MedPrice, 0.75)
IQR_val <- IQR(cincinnati$MedPrice)
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
cincinnati <- cincinnati %>%
mutate(is_outlier = MedPrice < lower | MedPrice > upper)
# See how many outliers are by river status
outlier_summary <- cincinnati %>%
filter(is_outlier) %>%
group_by(River) %>%
summarise(
n = n(),
pct = round(n()/nrow(cincinnati)*100, 1)
)
print(outlier_summary)
## # A tibble: 1 × 3
## River n pct
## <dbl> <int> <dbl>
## 1 0 11 7.3
ggplot(cincinnati, aes(
x = factor(River, labels = c("Non-river", "Riverfront")),
y = MedPrice
)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
labs(
title = "Median Home Price by River Status",
x = "River Status",
y = "Median Price ($1000s)"
) +
theme_minimal()
All the really low and really high prices outliers come from non-river homes. Riverfront homes all sit in the normal price range. So we can’t drop river front houses although their portion in our dataset is not significant.
#How does the crime rate affect housing prices across different price segments?
#create price quartiles
cincinnati$price_quartile <- cut(cincinnati$MedPrice,
breaks = quantile(cincinnati$MedPrice, probs = 0:4/4),
include.lowest = TRUE,
labels = c("Q1", "Q2", "Q3", "Q4"))
# calculate mean crime rate for each quartile
mean_crime_by_quartile <- cincinnati %>%
group_by(price_quartile) %>%
summarize(mean_crime_rate = mean(CrimeRate, na.rm = TRUE))
#create scatter plot with trend lines
ggplot(cincinnati, aes(x = MedPrice, y = CrimeRate)) +
geom_point(aes(color = price_quartile), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = price_quartile)) +
labs(title = "Crime Rate vs Housing Prices by Price Quartile",
x = "Housing Price",
y = "Crime Rate") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: As we can see here, it appears that crimerate has the
highest impact on low price homes. We can see that the lowest price
quartile (Q1) seems to have the highest crime rates, and the trend line
indicates that there is a strong negative correlation with housing price
and crime rate in Q1. Although there are some smal trends in the other
quartiles, they aren’t really clear enough to draw meaningful
conclusions. From this, it appears that high crime disproportionately
impacts lower price housing more than any other pricing group.
# Investigating the RelationShip between Housing Age and Room Number
ggplot(cincinnati, aes(x = Age, y = AvgRoom)) +
geom_point(color = "blue") +
labs(title = "Relationship between Housing Age and Average Room Numbers",
x = "Housing Age (Years)",
y = "Average Number of Rooms") +
theme_minimal()
# Calculate correlation coefficient
age_avgroom_cor <- cor(cincinnati$Age, cincinnati$AvgRoom)
print(paste("The Pearson correlation coefficient between Age and AvgRoom is:", round(age_avgroom_cor, 3)))
## [1] "The Pearson correlation coefficient between Age and AvgRoom is: -0.242"
cor.test(cincinnati$Age, cincinnati$AvgRoom)
##
## Pearson's product-moment correlation
##
## data: cincinnati$Age and cincinnati$AvgRoom
## t = -3.0386, df = 148, p-value = 0.002811
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.38753602 -0.08537876
## sample estimates:
## cor
## -0.2423246
Interpretation: The correlation between the age and AvgRoom is -0.2423 , which means there is a mild negative linear relationship between them and as house get older, they tend to have slightly fewer rooms on average. Upon checking the p-value which is 0.002811 < 0.05 , we can say the observation is significantly significant. Also the confidence interval does not contain zero, it supports that true correlation is different from zero and that there is a statistically significant negative relationship. We can conclude that, in our dataset, as the housing age increases, there is a slight tendency for the average number of rooms to decrease.
#Investigating: Is there a relationship between the student-teacher ratio (PTRatio) and the crime rate (CrimeRate) in the city?
ggplot(cincinnati, aes(x = PTRatio, y = CrimeRate)) +
geom_point(color = "blue") +
labs(title = "Relationship between PTRatio and CrimeRate",
x = "PTRatio",
y = "CrimeRate") +
theme_minimal()
#Calculate correlation coefficient
ptratio_crimerate_cor <- cor(cincinnati$PTRatio, cincinnati$CrimeRate)
print(paste("The Pearson correlation coefficient between PTRatio and CrimeRate is:", round(ptratio_crimerate_cor, 3)))
## [1] "The Pearson correlation coefficient between PTRatio and CrimeRate is: 0.275"
cor.test(cincinnati$PTRatio, cincinnati$CrimeRate)
##
## Pearson's product-moment correlation
##
## data: cincinnati$PTRatio and cincinnati$CrimeRate
## t = 3.4742, df = 148, p-value = 0.0006723
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1196018 0.4165308
## sample estimates:
## cor
## 0.2745999
Interpretation: The correlation between PTRatio and Crimerate is 0.274, which suggests that as the ratio of students to teachers increases, so does the crimerate. This intuitively makes sense, because less affuent cities will have less teachers for students, and areas such as this are typically more prone to crime. Upon checking the p-value which is 0.00067 < 0.05 , we can say the observation is significantly significant. Also the confidence interval does not contain zero, it supports that true correlation is different from zero and that there is a statistically significant negative relationship. We can conclude that, in our dataset, as the housing age increases, there is a slight tendency for the average number of rooms to decrease.
#Investigating: How do socioeconomic factors (LStat variable) correlate with property tax rates?
# LStat - % lower status of the population by city
# Tax - full-value property-tax rate per $10,000
cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cincinnati, aes(x = Tax, y = LStat)) +
geom_point() +
labs(
title = "Tax vs LStat"
)
# Attempting to explain strange grouping found in scatter plot
# grouping by river
ggplot(cincinnati, aes(x = Tax, y = LStat, color=as.factor(River))) +
geom_point() +
labs(
title = "Tax vs LStat by River",
color = "River"
)
# grouping by CrimeRate Category
cincinnati$CrimeRate <- cut(
cincinnati$CrimeRate,
breaks = c(0, 1, 5, Inf), # Define breakpoints for "low", "moderate", "high"
labels = c("low", "moderate", "high") # Labels for each category
)
ggplot(cincinnati, aes(x = Tax, y = LStat, color=as.factor(CrimeRate))) +
geom_point() +
labs(
title = "Tax vs LStat by Crime rate category",
color = "Crime rate category"
)
# grouping by Indus
cincinnati$Indus <- cut(
cincinnati$Indus,
breaks = c(0, 13.60556, Inf), # this cutoff is explained in the next block
labels = c("low", "high")
)
ggplot(cincinnati, aes(x = Tax, y = LStat, color=Indus)) +
geom_point() +
labs(
title = "Tax vs LStat by Indus",
color = "Indus"
)
Interpretation: There is a positive linear relationship between Tax and LStat, in general as Tax increases LStat also increases. However there seems to be a strange group around a value of ~675 for Tax. Examining the group by coloring by different categorical variables: - River doesn’t seem to have any effect on the grouping, there aren’t enough data points that bound the river (1) to make conclusions. - Crime rate categories of ‘moderate’ and ‘high’ seem to correspond well with the grouping. - Data points with high Indus levels seem to all be part of the strange grouping.
#Investigating: What is the relationship between industrial land use (Indus) and housing characteristics such as room sizes, property tax rates, and housing prices?
cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cincinnati, aes(x=Indus))+
geom_density()+
labs(
title = "Indus Density Plot",
x = "Indus",
y = "Density"
)
# === Finding where to split the data ===
# Compute the density
density_values <- density(cincinnati$Indus)
# Find local minima
x <- density_values$x
y <- density_values$y
# Compare each y-value with its neighbors
local_minima <- which(diff(sign(diff(y))) == 2) + 1 # Indices of local minima
# Extract the x and y coordinates of the minima
minima_x <- x[local_minima]
minima_y <- y[local_minima]
# Print the local minima
data.frame(x = minima_x, y = minima_y)
## x y
## 1 13.60556 0.01937223
# x y
# 1 13.60556 0.01937223
ggplot(cincinnati, aes(x = Indus)) +
geom_density() +
geom_vline(xintercept = minima_x, color = "red", linetype = "dashed", size = 1) +
labs(
title = "Indus Density Plot Split",
x = "Indus",
y = "Density"
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# === Splitting Indus -> low + high ===
cincinnati$Indus <- cut(
cincinnati$Indus,
breaks = c(0, minima_x, Inf),
labels = c("low", "high")
)
ggpairs(cincinnati[, c("Indus", "AvgRoom", "Tax", "MedPrice")], aes(color=cincinnati$Indus, alpha=0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: There seems to be roughly double the amount of low Indus as high Indus, The Indus level doesn’t have much effect of on average room number (AvgRoom) Median price (MedPrice) is shows slight change in distribution between the Indus levels. On average a house in a high Indus area will have a lower MedPrice. Tax rate (Tax) however, varies wildly between low and high Indus levels. Tax rates in a high Indus area will on average, be much higher than those in a low area.
cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Investigate bimodality in Indus, convert to categorical variable
#There appears to be bimodality with Indus
dens <- density(cincinnati$Indus)
plot(dens, main = "Checking for Bimodality in Indus")
polygon(dens, col = "skyblue", border = "darkblue")
#Find local minima
local_mins <- which(diff(sign(diff(dens$y))) == 2) + 1
min_x_values <- dens$x[local_mins]
min_y_values <- dens$y[local_mins]
#Add vertical lines and points at the local minima
abline(v = min_x_values, col = "red", lty = 2)
points(min_x_values, min_y_values, col = "red", pch = 19)
#Print the local minimum x-values
min_x_values
## [1] 13.60556
#Convert Indus to a categorical variable.
cincinnati$IndusCat <- ifelse(cincinnati$Indus < min_x_values, "Low", "High")
cincinnati$IndusCat <- factor(cincinnati$IndusCat, levels = c("Low", "High"))
ggpairs(data = cincinnati, aes(color = IndusCat, alpha = 0.6))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Investigate weird spread of CrimeRate, convert to categorical variable
ggplot(cincinnati, aes(x=CrimeRate))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(cincinnati$CrimeRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01096 0.07250 0.19160 3.34698 2.69719 88.97620
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.01096 0.07250 0.19160 3.34698 2.69719 88.97620
#recoding crimerate
# breakpoints: (0, Q1, Q3, inf) -> low, medium, high
#cincinnati$CrimeRate <- cut(
# cincinnati$CrimeRate,
# breaks = c(0, 0.07250, 2.69719, Inf), # Define breakpoints for "low", "medium", "high"
# labels = c("low", "medium", "high") # Labels for each category
#)
#ggpairs(cincinnati, aes(color=cincinnati$CrimeRate, alpha=0.5))
# ==== alt crimerate breakpoints ====
#cincinnati <- read.csv("CINCINNATI.csv")
#cincinnati$River <- recode(cincinnati$River, `0`="otherwise", `1`="bounds river")
# using (0, Q1, Q3, inf) as breakpoints for low, medium, high didn't seem to represent it well
# using arbitrary (0, 1, 5, inf) seems like better breakpoints
cincinnati$CrimeRateCat <- cut(
cincinnati$CrimeRate,
breaks = c(0, 1, 5, Inf), # Define breakpoints for "low", "medium", "high"
labels = c("Low", "Medium", "High") # Labels for each category
)
cincinnati$CrimeRateCat <- factor(cincinnati$CrimeRateCat, levels = c("Low", "Medium", "High"))
ggpairs(cincinnati, aes(color=cincinnati$CrimeRateCat, alpha=0.5))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#create binary variables for categorical variable
cincinnati$IsLowCrime <- ifelse(cincinnati$CrimeRateCat == "Low", 1, 0)
cincinnati$IsMediumCrime <- ifelse(cincinnati$CrimeRateCat == "Medium", 1, 0)
cincinnati$IsHighCrime <- ifelse(cincinnati$CrimeRateCat == "High", 1, 0)
#CrimeRate
m <- lm(MedPrice ~ CrimeRateCat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
#examine each category vs MedPrice
ggplot(cincinnati, aes(x = CrimeRate, y = MedPrice, color = CrimeRateCat)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Crime vs Median Housing Price",
x = "Crime Rate",
y = "Median Housing Price",
color = "Crime Category")
Findings: Looks fine, nothing of note here. Confirms that higher crime
rates generally correlate to lower housing prices
#IndusCat
m <- lm(MedPrice ~ IndusCat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
#examine each category vs MedPrice
ggplot(cincinnati, aes(x = Indus, y = MedPrice, color = IndusCat)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Indus vs Median Housing Price",
x = "Indus",
y = "Median Housing Price",
color = "Indus Category")
Findings: Looks fine, nothing of note. Confirms that higher Indus
categories generally have a moderately lower housing price
#River
m <- lm(MedPrice ~ River, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
Findings: Looks mostly ok, the normality looks slightly skewed but not
too alarming.
#AvgRoom
m <- lm(MedPrice ~ AvgRoom, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
Findings: Overall, it looks ok. The Residuals vs Fitted plot has a
moderate funnel shape, indicating possible slight levels of non-constant
varience. The trend line on the Residuals vs Fitted plot isn’t too bad,
so leave as is for now. Normality looks good.
#Age
m <- lm(MedPrice ~ Age, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
#What does log transform do?
m_log <- lm(log(MedPrice) ~ Age, data = cincinnati)
par(mfrow = c(2,2))
plot(m_log)
Findings: Residual plot looks a bit concerning. Close grouping below
residual line, spread grouping above it. Q-Q plot right side trends
upward rather quickly. Doing a log transform on MedPrice seems to fix
normality and tighten up the varience, though there is still a reverse
funnel shape, indicating more variance at lower prices. This makes
sense, since lower-priced homes are more susceptible to factors like
crime, age, etc. which aren’t in the model here.
#Tax
m <- lm(MedPrice ~ Tax, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
Findings: We have a wierd grouping of tax rates on the left side, which
corresponds to the high indus tax bracket, and then a normal spread on
the right, corresponding to the low indus cities. On both sides, the
residuals look evenly spread around the line, indicating equal varience
despite the weird groupings. The normality also looks good.
#PTRatio
m <- lm(MedPrice ~ PTRatio, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
Findings: Looks ok, nothing too noteworthy here.
#LStat
m <- lm(MedPrice ~ LStat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)
m1 <- lm(MedPrice ~ I(LStat^2), data = cincinnati)
par(mfrow = c(2,2))
plot(m1)
m2 <- lm(MedPrice ~ I(1/LStat), data = cincinnati)
par(mfrow = c(2,2))
plot(m2)
m3 <- lm(MedPrice ~ log(LStat), data = cincinnati)
par(mfrow = c(2,2))
plot(m3)
Findings: Initial plot of LStat indicates non-linearity in the residual
plot, and also in the Q-Q plot via a sharp deviation on the right end of
the line. To test the non-linearity theory, I applied two different
transformations on LStat: 1. Square LStat: Standard way to try to
linearize a relationship. This did nothing to help the residual plot,
and seems to have only sharpened the trend. 2. 1/LStat: This was
indicated by the shape of the LStat vs MedPrice plot earlier, where
there seemed to be a 1/x relationship.
I also decided to try a log transform on LStat, which actually seemed to significantly improve the residual plot. The Q-Q plots of 1/LStat and log(LStat) seem almost the same, but the residual plot is the best for log(Lstat), indicating that doing the log transformations seems to significantly improve the variance.
ggplot(cincinnati, aes(x = Tax, y = MedPrice, color = IndusCat)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Tax vs MedPrice w/IndusCat",
x = "Tax",
y = "MedPrice",
color = "Indus Category")
ggplot(cincinnati, aes(x = Tax, y = MedPrice, color = IndusCat)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "black", linetype = "dashed") + # Overall trendline
theme_minimal() +
labs(title = "Tax vs MedPrice with Trendlines (by IndusCat and Overall)",
x = "Tax",
y = "MedPrice",
color = "Indus Category")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Findings: The special group (High Indus, Tax = 666) isn’t drastically
deviating in terms of the Tax relationship. The blue trendline, which
includes the cluster at Tax ≈ 666, doesn’t show a sharp bend or a
completely different slope at that high tax value compared to the rest
of the blue line. This reinforces the observation that they might
roughly follow a similar trend with respect to tax. For now, I’ll leave
as is.
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(cincinnati), replace=TRUE, prob=c(0.7,0.3))
train <- cincinnati[sample, ]
test <- cincinnati[!sample, ]
full_model1 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + log(LStat), data = train)
summary(full_model1)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom +
## Age + Tax + PTRatio + log(LStat), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5632 -2.4985 -0.2042 2.2954 12.4737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.057207 9.870928 2.235 0.0278 *
## CrimeRateCatMedium -0.490999 1.841160 -0.267 0.7903
## CrimeRateCatHigh -4.650214 2.404492 -1.934 0.0561 .
## IndusCatHigh 0.791996 1.744445 0.454 0.6508
## River -1.096145 1.947074 -0.563 0.5748
## AvgRoom 4.205653 1.026139 4.099 8.7e-05 ***
## Age 0.015824 0.022194 0.713 0.4776
## Tax -0.003583 0.006213 -0.577 0.5655
## PTRatio -0.598354 0.238513 -2.509 0.0138 *
## log(LStat) -6.148356 1.474797 -4.169 6.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 96 degrees of freedom
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7643
## F-statistic: 38.83 on 9 and 96 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)
ad.test(full_model1$residuals)
##
## Anderson-Darling normality test
##
## data: full_model1$residuals
## A = 0.36117, p-value = 0.4397
full_model2 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train)
summary(full_model2)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom +
## Age + Tax + PTRatio + I(1/LStat), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4032 -2.5199 0.0155 2.1413 13.2817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.805374 7.114944 0.394 0.69424
## CrimeRateCatMedium -0.705842 1.787028 -0.395 0.69373
## CrimeRateCatHigh -6.041264 2.315745 -2.609 0.01054 *
## IndusCatHigh 0.757633 1.691698 0.448 0.65527
## River -1.725336 1.893379 -0.911 0.36445
## AvgRoom 4.325973 0.910087 4.753 7.02e-06 ***
## Age 0.008490 0.019941 0.426 0.67124
## Tax -0.002642 0.006038 -0.438 0.66271
## PTRatio -0.633996 0.231168 -2.743 0.00727 **
## I(1/LStat) 45.385868 9.159374 4.955 3.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.82 on 96 degrees of freedom
## Multiple R-squared: 0.7973, Adjusted R-squared: 0.7783
## F-statistic: 41.96 on 9 and 96 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model2)
ad.test(full_model2$residuals)
##
## Anderson-Darling normality test
##
## data: full_model2$residuals
## A = 0.2587, p-value = 0.7095
Findings: considering both the slightly better summary statistics and the potentially improved linearity suggested by the residuals plot, it appears that using I(1/LStat) is likely the better transformation for LStat in the linear regression model for predicting MedPrice in this dataset. Addtionally, the AD p-values for both models indicate that we don’t have strong evidence against normality.
best_subset_model <- regsubsets(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train, nbest = 2)
best_subset_summary <- summary(best_subset_model)
#Add RSS, Adjusted R², Cp, and BIC, MSE, etc to the same data frame
mse_values <- best_subset_summary$rss / (nrow(train) - 1:nrow(best_subset_summary$which))
mse_values <- round(mse_values, 2)
results <- data.frame(
modelId = 1:nrow(best_subset_summary$which),
best_subset_summary$outmat,
Rsq = best_subset_summary$rsq,
RSS = best_subset_summary$rss,
AdjR2 = best_subset_summary$adjr2,
Cp = best_subset_summary$cp,
BIC = best_subset_summary$bic,
MSE = mse_values
)
results$Rsq <- round(results$Rsq, 2)
results$RSS <- round(results$RSS, 2)
results$AdjR2 <- round(results$AdjR2, 2)
results$Cp <- round(results$Cp, 2)
results$BIC <- round(results$BIC, 2)
print(results)
## modelId CrimeRateCatMedium CrimeRateCatHigh IndusCatHigh River AvgRoom
## 1 ( 1 ) 1
## 1 ( 2 ) 2 *
## 2 ( 1 ) 3 *
## 2 ( 2 ) 4
## 3 ( 1 ) 5 * *
## 3 ( 2 ) 6 *
## 4 ( 1 ) 7 * *
## 4 ( 2 ) 8 * *
## 5 ( 1 ) 9 * * *
## 5 ( 2 ) 10 * *
## 6 ( 1 ) 11 * * *
## 6 ( 2 ) 12 * * *
## 7 ( 1 ) 13 * * *
## 7 ( 2 ) 14 * * * *
## 8 ( 1 ) 15 * * * *
## 8 ( 2 ) 16 * * * * *
## Age Tax PTRatio I.1.LStat. Rsq RSS AdjR2 Cp BIC MSE
## 1 ( 1 ) * 0.65 2430.32 0.65 64.52 -101.48 23.15
## 1 ( 2 ) 0.47 3643.46 0.47 147.65 -58.56 35.03
## 2 ( 1 ) * 0.70 2048.51 0.70 40.36 -114.94 19.89
## 2 ( 2 ) * * 0.70 2049.82 0.70 40.45 -114.87 20.10
## 3 ( 1 ) * 0.78 1544.77 0.77 7.85 -140.19 15.29
## 3 ( 2 ) * * 0.76 1676.03 0.75 16.84 -131.55 16.76
## 4 ( 1 ) * * 0.79 1424.74 0.79 1.62 -144.10 14.39
## 4 ( 2 ) * * 0.78 1528.43 0.77 8.73 -136.66 15.60
## 5 ( 1 ) * * 0.80 1411.26 0.79 2.70 -140.45 14.55
## 5 ( 2 ) * * * 0.79 1419.18 0.78 3.24 -139.85 14.78
## 6 ( 1 ) * * * 0.80 1408.20 0.78 4.49 -136.01 14.82
## 6 ( 2 ) * * * 0.80 1408.31 0.78 4.50 -136.00 14.98
## 7 ( 1 ) * * * * 0.80 1404.73 0.78 6.25 -131.61 15.10
## 7 ( 2 ) * * * 0.80 1405.12 0.78 6.28 -131.58 15.27
## 8 ( 1 ) * * * * 0.80 1403.35 0.78 8.16 -127.05 15.42
## 8 ( 2 ) * * * 0.80 1403.72 0.78 8.18 -127.02 15.60
p1<-ggplot(data = results, aes(x = as.factor(modelId), y = MSE))+
geom_point(color="black", size=2) +ylab("MSE") + xlab("ModelId")
p2<-ggplot(data = results, aes(x = as.factor(modelId), y = AdjR2)) +
geom_point(color="black", size=2) +ylab("Adjusted RSQ") + xlab("ModelId")+
geom_point(data = results[which.max(results$AdjR2), ], color="red",
size=3)
p3<-ggplot(data = results, aes(x = as.factor(modelId), y = Cp)) +
geom_point(color="black", size=2) +ylab("Cp") + xlab("ModelId")+
geom_point(data = results[which.min(results$Cp), ], color="red",
size=3)
p4<-ggplot(data = results, aes(x = as.factor(modelId), y = BIC)) +
geom_point(color="black", size=2) +ylab("BIC") + xlab("ModelId")+
geom_point(data = results[which.min(results$BIC), ], color="red",
size=3)
grid.arrange(p1,p2,p3,p4,nrow=2)
Findings: It looks like model 7 is the clear winner. It is closely tied
for lowest MSE, has the best Adjust RSQ, and lowest Cp and Bix
values.
coef(best_subset_model, 7)
## (Intercept) CrimeRateCatHigh AvgRoom PTRatio
## 1.0133460 -5.8807897 4.4816233 -0.6078183
## I(1/LStat)
## 44.3289871
model7 <- lm(MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat), data=train)
summary(model7)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.081 -2.265 0.068 2.099 13.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2525 6.8234 0.184 0.85473
## CrimeRateCatMedium -0.6695 1.1804 -0.567 0.57186
## CrimeRateCatHigh -6.0447 1.1389 -5.308 6.70e-07 ***
## AvgRoom 4.4965 0.8471 5.308 6.69e-07 ***
## PTRatio -0.6133 0.2093 -2.930 0.00419 **
## I(1/LStat) 43.2888 7.9558 5.441 3.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared: 0.7946, Adjusted R-squared: 0.7843
## F-statistic: 77.36 on 5 and 100 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)
ad.test(full_model1$residuals)
##
## Anderson-Darling normality test
##
## data: full_model1$residuals
## A = 0.36117, p-value = 0.4397
anova(model7)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## CrimeRateCat 2 2233.48 1116.74 78.634 < 2.2e-16 ***
## AvgRoom 1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio 1 214.14 214.14 15.078 0.0001853 ***
## I(1/LStat) 1 420.46 420.46 29.606 3.774e-07 ***
## Residuals 100 1420.17 14.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model7)
## [1] 1420.171
vif(model7)
## GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108 2 1.106886
## AvgRoom 1.974105 1 1.405029
## PTRatio 1.423441 1 1.193081
## I(1/LStat) 2.430542 1 1.559020
Findings: The linear model fitted to the training data, predicting MedPrice using CrimeRateCat, AvgRoom, PTRatio, and I(1/LStat), demonstrates a strong overall fit, with a high Multiple R-squared of 0.7946 and a significant F-statistic (p < 2.2e-16). The ANOVA results indicate that all four predictors, CrimeRateCat, AvgRoom, PTRatio, and I(1/LStat), are statistically significant contributors to explaining variation in median housing prices.
Specifically, a higher average number of rooms and a lower percentage of lower-status population (reflected by a higher value of 1/LStat) are associated with higher median prices. A higher pupil-teacher ratio (PTRatio) is associated with lower median prices. The crime rate category also has a significant impact, with areas classified as “High” showing substantially lower median prices compared to the “Low” reference group.
The residual sum of squares for the model is 1420.17, and an Anderson-Darling test on the residuals suggests no strong evidence against normality (p = 0.4397). Finally, variance inflation factors (VIFs) are all below 2, indicating that multicollinearity is not a serious concern in the model
null_model <- lm(MedPrice ~ 1, data = train)
upper_model <- formula(full_model2)
forward_model <- step(null_model, direction = "forward", scope = upper_model)
## Start: AIC=444.84
## MedPrice ~ 1
##
## Df Sum of Sq RSS AIC
## + I(1/LStat) 1 4482.8 2430.3 336.03
## + AvgRoom 1 3269.6 3643.5 378.95
## + PTRatio 1 2358.7 4554.4 402.60
## + Tax 1 2326.8 4586.3 403.34
## + CrimeRateCat 2 2233.5 4679.6 407.48
## + IndusCat 1 1668.4 5244.7 417.56
## + Age 1 1650.9 5262.1 417.91
## <none> 6913.1 444.84
## + River 1 1.9 6911.2 446.81
##
## Step: AIC=336.03
## MedPrice ~ I(1/LStat)
##
## Df Sum of Sq RSS AIC
## + PTRatio 1 380.50 2049.8 319.98
## + CrimeRateCat 2 382.33 2048.0 321.88
## + Tax 1 315.91 2114.4 323.27
## + AvgRoom 1 289.68 2140.6 324.57
## + IndusCat 1 112.04 2318.3 333.02
## <none> 2430.3 336.03
## + Age 1 5.48 2424.8 337.79
## + River 1 0.00 2430.3 338.03
##
## Step: AIC=319.98
## MedPrice ~ I(1/LStat) + PTRatio
##
## Df Sum of Sq RSS AIC
## + AvgRoom 1 220.334 1829.5 309.93
## + CrimeRateCat 2 229.500 1820.3 311.39
## + Tax 1 159.924 1889.9 313.37
## + IndusCat 1 78.503 1971.3 317.84
## <none> 2049.8 319.98
## + River 1 11.286 2038.5 321.39
## + Age 1 2.361 2047.5 321.86
##
## Step: AIC=309.93
## MedPrice ~ I(1/LStat) + PTRatio + AvgRoom
##
## Df Sum of Sq RSS AIC
## + CrimeRateCat 2 409.31 1420.2 287.08
## + Tax 1 280.97 1548.5 294.25
## + IndusCat 1 144.83 1684.7 303.18
## <none> 1829.5 309.93
## + Age 1 16.27 1813.2 310.98
## + River 1 5.18 1824.3 311.62
##
## Step: AIC=287.08
## MedPrice ~ I(1/LStat) + PTRatio + AvgRoom + CrimeRateCat
##
## Df Sum of Sq RSS AIC
## <none> 1420.2 287.08
## + River 1 10.0761 1410.1 288.32
## + Age 1 3.0894 1417.1 288.85
## + Tax 1 1.9217 1418.2 288.94
## + IndusCat 1 1.6338 1418.5 288.96
coef(forward_model)
## (Intercept) I(1/LStat) PTRatio AvgRoom
## 1.2525372 43.2887552 -0.6133069 4.4965043
## CrimeRateCatMedium CrimeRateCatHigh
## -0.6695093 -6.0447106
summary(forward_model)
##
## Call:
## lm(formula = MedPrice ~ I(1/LStat) + PTRatio + AvgRoom + CrimeRateCat,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.081 -2.265 0.068 2.099 13.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2525 6.8234 0.184 0.85473
## I(1/LStat) 43.2888 7.9558 5.441 3.77e-07 ***
## PTRatio -0.6133 0.2093 -2.930 0.00419 **
## AvgRoom 4.4965 0.8471 5.308 6.69e-07 ***
## CrimeRateCatMedium -0.6695 1.1804 -0.567 0.57186
## CrimeRateCatHigh -6.0447 1.1389 -5.308 6.70e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared: 0.7946, Adjusted R-squared: 0.7843
## F-statistic: 77.36 on 5 and 100 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(forward_model)
ad.test(forward_model$residuals)
##
## Anderson-Darling normality test
##
## data: forward_model$residuals
## A = 0.29576, p-value = 0.5886
anova(forward_model)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## I(1/LStat) 1 4482.8 4482.8 315.650 < 2.2e-16 ***
## PTRatio 1 380.5 380.5 26.793 1.172e-06 ***
## AvgRoom 1 220.3 220.3 15.515 0.0001516 ***
## CrimeRateCat 2 409.3 204.7 14.411 3.167e-06 ***
## Residuals 100 1420.2 14.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(forward_model)
## [1] 1420.171
vif(forward_model)
## GVIF Df GVIF^(1/(2*Df))
## I(1/LStat) 2.430542 1 1.559020
## PTRatio 1.423441 1 1.193081
## AvgRoom 1.974105 1 1.405029
## CrimeRateCat 1.501108 2 1.106886
Findings: Same model as was produced by best subset selection (model7)
backward_model <- step(full_model2, direction = "backward")
## Start: AIC=293.64
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age +
## Tax + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - Age 1 2.65 1403.7 291.85
## - Tax 1 2.79 1403.9 291.86
## - IndusCat 1 2.93 1404.0 291.87
## - River 1 12.12 1413.2 292.56
## <none> 1401.1 293.64
## - PTRatio 1 109.78 1510.8 299.64
## - CrimeRateCat 2 144.00 1545.1 300.01
## - AvgRoom 1 329.76 1730.8 314.05
## - I(1/LStat) 1 358.34 1759.4 315.79
##
## Step: AIC=291.84
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Tax +
## PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - Tax 1 4.22 1407.9 290.16
## - IndusCat 1 4.46 1408.2 290.18
## - River 1 10.89 1414.6 290.66
## <none> 1403.7 291.84
## - PTRatio 1 109.12 1512.8 297.78
## - CrimeRateCat 2 141.36 1545.1 298.02
## - AvgRoom 1 390.26 1794.0 315.85
## - I(1/LStat) 1 417.23 1821.0 317.43
##
## Step: AIC=290.16
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + PTRatio +
## I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - IndusCat 1 2.16 1410.1 288.32
## - River 1 10.60 1418.5 288.96
## <none> 1407.9 290.16
## - PTRatio 1 127.90 1535.8 297.38
## - CrimeRateCat 2 276.62 1684.6 305.18
## - AvgRoom 1 388.90 1796.8 314.02
## - I(1/LStat) 1 421.75 1829.7 315.94
##
## Step: AIC=288.33
## MedPrice ~ CrimeRateCat + River + AvgRoom + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - River 1 10.08 1420.2 287.08
## <none> 1410.1 288.32
## - PTRatio 1 129.70 1539.8 295.65
## - CrimeRateCat 2 414.20 1824.3 311.62
## - AvgRoom 1 390.57 1800.7 312.24
## - I(1/LStat) 1 421.76 1831.8 314.06
##
## Step: AIC=287.08
## MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## <none> 1420.2 287.08
## - PTRatio 1 121.95 1542.1 293.81
## - CrimeRateCat 2 409.31 1829.5 309.93
## - AvgRoom 1 400.15 1820.3 311.39
## - I(1/LStat) 1 420.46 1840.6 312.57
coef(backward_model)
## (Intercept) CrimeRateCatMedium CrimeRateCatHigh AvgRoom
## 1.2525372 -0.6695093 -6.0447106 4.4965043
## PTRatio I(1/LStat)
## -0.6133069 43.2887552
summary(backward_model)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.081 -2.265 0.068 2.099 13.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2525 6.8234 0.184 0.85473
## CrimeRateCatMedium -0.6695 1.1804 -0.567 0.57186
## CrimeRateCatHigh -6.0447 1.1389 -5.308 6.70e-07 ***
## AvgRoom 4.4965 0.8471 5.308 6.69e-07 ***
## PTRatio -0.6133 0.2093 -2.930 0.00419 **
## I(1/LStat) 43.2888 7.9558 5.441 3.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared: 0.7946, Adjusted R-squared: 0.7843
## F-statistic: 77.36 on 5 and 100 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(backward_model)
ad.test(backward_model$residuals)
##
## Anderson-Darling normality test
##
## data: backward_model$residuals
## A = 0.29576, p-value = 0.5886
anova(backward_model)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## CrimeRateCat 2 2233.48 1116.74 78.634 < 2.2e-16 ***
## AvgRoom 1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio 1 214.14 214.14 15.078 0.0001853 ***
## I(1/LStat) 1 420.46 420.46 29.606 3.774e-07 ***
## Residuals 100 1420.17 14.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(backward_model)
## [1] 1420.171
vif(backward_model)
## GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108 2 1.106886
## AvgRoom 1.974105 1 1.405029
## PTRatio 1.423441 1 1.193081
## I(1/LStat) 2.430542 1 1.559020
Findings: Also produced the exact same model as the forward selection and model7
Model Comparison & Analysis
#Comparing model7 with another model with tax rate.
# We took model 8 here which takes Crime Rate, Avg room. , Tax and 1/LStat which has Adj. Rsq:0.77, Cp: 8.73, BIC:-136.66 and MSE of 15.60
coef(best_subset_model, 8)
## (Intercept) CrimeRateCatHigh AvgRoom Tax
## -11.61317438 -5.41653946 4.95850474 -0.00472546
## I(1/LStat)
## 45.94558786
model8 <- lm(MedPrice ~ CrimeRateCat + AvgRoom + Tax + I(1/LStat), data=train)
summary(model8)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + Tax + I(1/LStat),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9249 -2.5300 0.1701 2.0159 13.3584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.436996 5.224027 -2.189 0.0309 *
## CrimeRateCatMedium 0.280562 1.471531 0.191 0.8492
## CrimeRateCatHigh -5.171072 2.166700 -2.387 0.0189 *
## AvgRoom 4.951500 0.862965 5.738 1.03e-07 ***
## Tax -0.005308 0.005497 -0.966 0.3366
## I(1/LStat) 46.125787 8.206263 5.621 1.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.909 on 100 degrees of freedom
## Multiple R-squared: 0.779, Adjusted R-squared: 0.7679
## F-statistic: 70.49 on 5 and 100 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)
ad.test(full_model1$residuals)
##
## Anderson-Darling normality test
##
## data: full_model1$residuals
## A = 0.36117, p-value = 0.4397
anova(model8)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## CrimeRateCat 2 2233.48 1116.74 73.0910 < 2.2e-16 ***
## AvgRoom 1 2624.84 2624.84 171.7969 < 2.2e-16 ***
## Tax 1 44.18 44.18 2.8916 0.09215 .
## I(1/LStat) 1 482.71 482.71 31.5934 1.723e-07 ***
## Residuals 100 1527.88 15.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model8)
## [1] 1527.876
vif(model8)
## GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 5.044310 2 1.498651
## AvgRoom 1.904309 1 1.379967
## Tax 5.038647 1 2.244693
## I(1/LStat) 2.403677 1 1.550380
Findings: The linear regression model predicting MedPrice using CrimeRateCat, AvgRoom, Tax, and 1/LStat demonstrates a strong fit to the training data, with a Multiple R-squared of 0.779 and an Adjusted R-squared of 0.7679. The model is statistically significant overall, as indicated by a high F-statistic (70.49, p < 2.2e-16). Among the predictors, AvgRoom and 1/LStat are highly significant (p < 0.001), suggesting that homes with more rooms and a lower proportion of lower-status residents (captured by a higher value of 1/LStat) are associated with higher median housing prices.
Additionally, CrimeRateCatHigh is statistically significant (p = 0.0189), indicating that homes in high crime areas tend to have notably lower median prices compared to low crime areas. However, the Tax variable and CrimeRateCatMedium are not statistically significant (p > 0.3 and p > 0.8 respectively), suggesting weaker explanatory power in the current model context.
The residual standard error is 3.909 on 100 degrees of freedom, and the residuals appear symmetrically distributed, as visualized in the diagnostic plots. An Anderson-Darling test on the residuals does not indicate a violation of normality assumptions. Variance Inflation Factors (VIFs) are all within acceptable ranges, indicating no significant multicollinearity concerns. Overall, while this model is slightly weaker than Model 7, it still captures key predictors of housing prices with relatively strong explanatory power.
#Calculate MSPE with test data, compare
# Predict MedPrice using model 7 and model 8 on test data
pred_model7 <- predict(model7, newdata = test)
pred_model8 <- predict(model8, newdata = test)
# Mean Squared Prediction Error
mspe_model7 <- mean((test$MedPrice - pred_model7)^2)
mspe_model8 <- mean((test$MedPrice - pred_model8)^2)
# Print the results
mspe_model7
## [1] 15.2939
mspe_model8
## [1] 15.67446
#
Findings: Model 7 performs slightly better than Model 8 on the test set as it has a lower Mean Squared Prediction Error.
Experiment building models With Removing Tax Rate outlier (666)
#Drop outliers from train and test
cincinnati <- read_csv("CINCINNATI.csv")
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cincinnati_clean <- cincinnati %>%
filter(Tax <= 600)
head(cincinnati_clean)
## # A tibble: 6 × 9
## CrimeRate Indus River AvgRoom Age Tax PTRatio LStat MedPrice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0273 7.07 0 7.18 61 242 17.8 4.03 34.7
## 2 0.254 9.9 0 5.70 78 304 18.4 11.5 16.2
## 3 0.0396 5.19 0 6.04 35 224 20.2 8.01 21.1
## 4 0.145 10.0 0 5.73 65 432 17.8 13.6 19.3
## 5 0.786 3.97 0 7.01 85 264 13 14.8 30.7
## 6 0.0110 2.25 0 6.45 32 300 15.3 8.23 22
#IndusCat
#There appears to be bimodality with Indus
dens <- density(cincinnati_clean$Indus)
plot(dens, main = "Checking for Bimodality in Indus")
polygon(dens, col = "skyblue", border = "darkblue")
#Find local minima
local_mins <- which(diff(sign(diff(dens$y))) == 2) + 1
min_x_values <- dens$x[local_mins]
min_y_values <- dens$y[local_mins]
#Add vertical lines and points at the local minima
abline(v = min_x_values, col = "red", lty = 2)
points(min_x_values, min_y_values, col = "red", pch = 19)
#Print the local minimum x-values
min_x_values
## [1] 16.37088
#Convert Indus to a categorical variable.
cincinnati_clean$IndusCat <- ifelse(cincinnati_clean$Indus < min_x_values, "Low", "High")
cincinnati_clean$IndusCat <- factor(cincinnati_clean$IndusCat, levels = c("Low", "High"))
#ggpairs(data = cincinnati_clean, aes(color = IndusCat, alpha = 0.6))
#CrimeRateCat
cincinnati_clean$CrimeRateCat <- cut(
cincinnati_clean$CrimeRate,
breaks = c(0, 1, 5, Inf), # Define breakpoints for "low", "medium", "high"
labels = c("Low", "Medium", "High") # Labels for each category
)
cincinnati_clean$CrimeRateCat <- factor(cincinnati_clean$CrimeRateCat, levels = c("Low", "Medium", "High"))
#ggpairs(cincinnati_clean, aes(color=cincinnati_clean$CrimeRateCat, alpha=0.5))
#create binary variables for categorical variable
cincinnati_clean$IsLowCrime <- ifelse(cincinnati_clean$CrimeRateCat == "Low", 1, 0)
cincinnati_clean$IsMediumCrime <- ifelse(cincinnati_clean$CrimeRateCat == "Medium", 1, 0)
cincinnati_clean$IsHighCrime <- ifelse(cincinnati_clean$CrimeRateCat == "High", 1, 0)
#train and test Set
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(cincinnati_clean), replace=TRUE, prob=c(0.7,0.3))
train_clean <- cincinnati_clean[sample, ]
test_clean <- cincinnati_clean[!sample, ]
full_model3 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + log(LStat), data = train_clean)
summary(full_model3)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom +
## Age + Tax + PTRatio + log(LStat), data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0524 -2.0463 -0.7214 2.1495 11.4016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.301608 11.694389 -1.308 0.19494
## CrimeRateCatMedium -1.210951 1.942831 -0.623 0.53509
## IndusCatHigh 1.289027 1.884267 0.684 0.49614
## River -0.581074 1.882440 -0.309 0.75847
## AvgRoom 9.160766 1.214329 7.544 1.17e-10 ***
## Age -0.029340 0.022859 -1.284 0.20349
## Tax -0.021631 0.006402 -3.379 0.00118 **
## PTRatio -0.516995 0.255485 -2.024 0.04678 *
## log(LStat) -0.439619 1.741898 -0.252 0.80148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.64 on 71 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.7999
## F-statistic: 40.48 on 8 and 71 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model3)
ad.test(full_model3$residuals)
##
## Anderson-Darling normality test
##
## data: full_model3$residuals
## A = 0.52605, p-value = 0.1747
full_model4 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train_clean)
summary(full_model4)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom +
## Age + Tax + PTRatio + I(1/LStat), data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6036 -2.1352 -0.3135 2.1639 11.6091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.081151 8.547681 -1.179 0.24218
## CrimeRateCatMedium -1.419221 1.895672 -0.749 0.45653
## IndusCatHigh 1.285426 1.829310 0.703 0.48455
## River -0.961953 1.843444 -0.522 0.60342
## AvgRoom 7.658635 1.140155 6.717 3.86e-09 ***
## Age -0.011516 0.020463 -0.563 0.57538
## Tax -0.020728 0.006217 -3.334 0.00136 **
## PTRatio -0.550077 0.249258 -2.207 0.03056 *
## I(1/LStat) 20.064800 10.272092 1.953 0.05472 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.547 on 71 degrees of freedom
## Multiple R-squared: 0.8292, Adjusted R-squared: 0.81
## F-statistic: 43.09 on 8 and 71 DF, p-value: < 2.2e-16
plot(full_model4)
ad.test(full_model4$residuals)
##
## Anderson-Darling normality test
##
## data: full_model4$residuals
## A = 0.32316, p-value = 0.5207
# second model (fullmodel 4 ) is performing better in this case as well.Doing bestsubset on full_model4 now
best_subset_model_clean <- regsubsets(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train_clean, nbest = 2)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
## Reordering variables and trying again:
best_subset_summary_clean <- summary(best_subset_model_clean)
#Add RSS, Adjusted R², Cp, and BIC, MSE, etc to the same data frame
mse_values <- best_subset_summary_clean$rss / (nrow(train) - 1:nrow(best_subset_summary_clean$which))
mse_values <- round(mse_values, 2)
results_clean <- data.frame(
NumVar = 1:nrow(best_subset_summary_clean$which),
best_subset_summary_clean$outmat,
Rsq = best_subset_summary_clean$rsq,
RSS = best_subset_summary_clean$rss,
AdjR2 = best_subset_summary_clean$adjr2,
Cp = best_subset_summary_clean$cp,
BIC = best_subset_summary_clean$bic,
MSE = mse_values
)
results_clean$Rsq <- round(results_clean$Rsq, 2)
results_clean$RSS <- round(results_clean$RSS, 2)
results_clean$AdjR2 <- round(results_clean$AdjR2, 2)
results_clean$Cp <- round(results_clean$Cp, 2)
results_clean$BIC <- round(results_clean$BIC, 2)
print(results_clean)
## NumVar CrimeRateCatMedium CrimeRateCatHigh IndusCatHigh River AvgRoom
## 1 ( 1 ) 1 *
## 1 ( 2 ) 2
## 2 ( 1 ) 3 *
## 2 ( 2 ) 4 *
## 3 ( 1 ) 5 *
## 3 ( 2 ) 6 *
## 4 ( 1 ) 7 *
## 4 ( 2 ) 8 *
## 5 ( 1 ) 9 *
## 5 ( 2 ) 10 * *
## 6 ( 1 ) 11 * * *
## 6 ( 2 ) 12 * *
## 7 ( 1 ) 13 * * *
## 7 ( 2 ) 14 * * * *
## 8 ( 1 ) 15 * * * *
## 8 ( 2 ) 16 * * * *
## Age Tax PTRatio I.1.LStat. Rsq RSS AdjR2 Cp BIC MSE
## 1 ( 1 ) 0.75 1330.25 0.74 28.24 -100.77 12.67
## 1 ( 2 ) * 0.64 1888.69 0.63 72.00 -72.73 18.16
## 2 ( 1 ) * 0.79 1110.74 0.78 13.04 -110.82 10.78
## 2 ( 2 ) * 0.78 1149.95 0.77 16.11 -108.04 11.27
## 3 ( 1 ) * * 0.81 986.64 0.80 5.31 -115.91 9.77
## 3 ( 2 ) * * 0.81 1006.49 0.80 6.87 -114.32 10.06
## 4 ( 1 ) * * * 0.83 914.90 0.82 1.69 -117.57 9.24
## 4 ( 2 ) * * * 0.82 950.14 0.81 4.45 -114.54 9.70
## 5 ( 1 ) * * * * 0.83 906.23 0.82 3.01 -113.95 9.34
## 5 ( 2 ) * * * 0.83 906.39 0.82 3.02 -113.93 9.44
## 6 ( 1 ) * * * 0.83 901.98 0.81 4.68 -109.94 9.49
## 6 ( 2 ) * * * * 0.83 902.29 0.81 4.70 -109.91 9.60
## 7 ( 1 ) * * * * 0.83 896.75 0.81 6.27 -106.03 9.64
## 7 ( 2 ) * * * 0.83 897.31 0.81 6.31 -105.98 9.75
## 8 ( 1 ) * * * * 0.83 893.32 0.81 8.00 -101.95 9.82
## 8 ( 2 ) * * * * 0.83 896.75 0.81 8.27 -101.64 9.96
# four plots for comparison
p1<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = MSE))+
geom_point(color="black", size=2) +ylab("MSE") + xlab("No of variables")
p2<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = AdjR2)) +
geom_point(color="black", size=2) +ylab("Adjusted RSQ") + xlab("No of variables")+
geom_point(data = results_clean[which.max(results_clean$AdjR2), ], color="red",
size=3)
p3<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = Cp)) +
geom_point(color="black", size=2) +ylab("Cp") + xlab("No of variables")+
geom_point(data = results_clean[which.min(results_clean$Cp), ], color="red",
size=3)
p4<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = BIC)) +
geom_point(color="black", size=2) +ylab("BIC") + xlab("No of variables")+
geom_point(data = results_clean[which.min(results_clean$BIC), ], color="red",
size=3)
grid.arrange(p1,p2,p3,p4,nrow=2)
#Findings: Model 7 is the winner in all cases but in model 7 , predictors has changed
# CrimeRate is dropped and Tax is added as the new predictros keeping othe predictors as the same (AVGRooom, PTRatio, 1/LSAt) ,getting model 7
coef(best_subset_model_clean, 7)
## (Intercept) Age PTRatio I(1/LStat)
## 26.69642489 0.02589852 -0.81653124 79.84524930
## CrimeRateCatHigh
## 0.00000000
model7_clean <- lm(MedPrice ~ AvgRoom + Tax + PTRatio + I(1/LStat), data=train_clean)
summary(model7_clean)
##
## Call:
## lm(formula = MedPrice ~ AvgRoom + Tax + PTRatio + I(1/LStat),
## data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8828 -1.9872 -0.0568 2.1489 11.6709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.439828 7.802814 -1.338 0.184953
## AvgRoom 7.498197 1.022491 7.333 2.19e-10 ***
## Tax -0.020854 0.005742 -3.632 0.000512 ***
## PTRatio -0.538243 0.221944 -2.425 0.017709 *
## I(1/LStat) 23.676669 8.353881 2.834 0.005899 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.493 on 75 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.8158
## F-statistic: 88.45 on 4 and 75 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model7_clean)
ad.test(model7_clean$residuals)
##
## Anderson-Darling normality test
##
## data: model7_clean$residuals
## A = 0.20647, p-value = 0.8648
anova(model7_clean)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## AvgRoom 1 3900.4 3900.4 319.7439 < 2.2e-16 ***
## Tax 1 219.5 219.5 17.9948 6.255e-05 ***
## PTRatio 1 97.9 97.9 8.0215 0.005933 **
## I(1/LStat) 1 98.0 98.0 8.0328 0.005899 **
## Residuals 75 914.9 12.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model7_clean)
## [1] 914.8952
vif(model7_clean)
## AvgRoom Tax PTRatio I(1/LStat)
## 2.767988 1.091269 1.177228 2.848722
#Findings : 82.5% of the variation is explained by these four predictors , better than before dropping tax,which was 78.5%
#Linear Regression - Forward Selection
null_model_clean <- lm(MedPrice ~ 1, data = train_clean)
upper_model_clean <- formula(full_model4)
forward_model_clean <- step(null_model_clean, direction = "forward", scope = upper_model_clean)
## Start: AIC=336.42
## MedPrice ~ 1
##
## Df Sum of Sq RSS AIC
## + AvgRoom 1 3900.4 1330.2 228.89
## + I(1/LStat) 1 3342.0 1888.7 256.93
## + PTRatio 1 1057.4 4173.2 320.35
## + Tax 1 865.4 4365.3 323.95
## + Age 1 861.9 4368.8 324.02
## + CrimeRateCat 1 514.1 4716.6 330.15
## + IndusCat 1 450.6 4780.1 331.21
## <none> 5230.7 336.42
## + River 1 40.6 5190.1 337.80
##
## Step: AIC=228.89
## MedPrice ~ AvgRoom
##
## Df Sum of Sq RSS AIC
## + Tax 1 219.512 1110.7 216.46
## + I(1/LStat) 1 180.298 1150.0 219.24
## + Age 1 156.664 1173.6 220.86
## + PTRatio 1 107.848 1222.4 224.12
## + CrimeRateCat 1 60.860 1269.4 227.14
## + IndusCat 1 45.794 1284.5 228.09
## <none> 1330.2 228.89
## + River 1 0.650 1329.6 230.85
##
## Step: AIC=216.46
## MedPrice ~ AvgRoom + Tax
##
## Df Sum of Sq RSS AIC
## + I(1/LStat) 1 124.097 986.64 208.98
## + Age 1 104.242 1006.49 210.58
## + PTRatio 1 97.851 1012.88 211.08
## <none> 1110.74 216.46
## + CrimeRateCat 1 14.155 1096.58 217.43
## + IndusCat 1 2.870 1107.87 218.25
## + River 1 0.228 1110.51 218.44
##
## Step: AIC=208.98
## MedPrice ~ AvgRoom + Tax + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## + PTRatio 1 71.743 914.90 204.94
## <none> 986.64 208.98
## + Age 1 22.672 963.97 209.12
## + CrimeRateCat 1 4.792 981.85 210.59
## + IndusCat 1 0.187 986.45 210.97
## + River 1 0.050 986.59 210.98
##
## Step: AIC=204.94
## MedPrice ~ AvgRoom + Tax + I(1/LStat) + PTRatio
##
## Df Sum of Sq RSS AIC
## <none> 914.90 204.94
## + Age 1 8.6694 906.23 206.18
## + CrimeRateCat 1 8.5067 906.39 206.19
## + River 1 7.8304 907.06 206.25
## + IndusCat 1 0.0039 914.89 206.94
coef(forward_model_clean)
## (Intercept) AvgRoom Tax I(1/LStat) PTRatio
## -10.43982834 7.49819680 -0.02085357 23.67666901 -0.53824336
summary(forward_model_clean)
##
## Call:
## lm(formula = MedPrice ~ AvgRoom + Tax + I(1/LStat) + PTRatio,
## data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8828 -1.9872 -0.0568 2.1489 11.6709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.439828 7.802814 -1.338 0.184953
## AvgRoom 7.498197 1.022491 7.333 2.19e-10 ***
## Tax -0.020854 0.005742 -3.632 0.000512 ***
## I(1/LStat) 23.676669 8.353881 2.834 0.005899 **
## PTRatio -0.538243 0.221944 -2.425 0.017709 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.493 on 75 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.8158
## F-statistic: 88.45 on 4 and 75 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(forward_model_clean)
ad.test(forward_model_clean$residuals)
##
## Anderson-Darling normality test
##
## data: forward_model_clean$residuals
## A = 0.20647, p-value = 0.8648
anova(forward_model_clean)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## AvgRoom 1 3900.4 3900.4 319.7439 < 2.2e-16 ***
## Tax 1 219.5 219.5 17.9948 6.255e-05 ***
## I(1/LStat) 1 124.1 124.1 10.1730 0.002081 **
## PTRatio 1 71.7 71.7 5.8813 0.017709 *
## Residuals 75 914.9 12.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(forward_model_clean)
## [1] 914.8952
vif(forward_model_clean)
## AvgRoom Tax I(1/LStat) PTRatio
## 2.767988 1.091269 2.848722 1.177228
#Backward Selection
backward_model_clean <- step(full_model2, direction = "backward")
## Start: AIC=293.64
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age +
## Tax + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - Age 1 2.65 1403.7 291.85
## - Tax 1 2.79 1403.9 291.86
## - IndusCat 1 2.93 1404.0 291.87
## - River 1 12.12 1413.2 292.56
## <none> 1401.1 293.64
## - PTRatio 1 109.78 1510.8 299.64
## - CrimeRateCat 2 144.00 1545.1 300.01
## - AvgRoom 1 329.76 1730.8 314.05
## - I(1/LStat) 1 358.34 1759.4 315.79
##
## Step: AIC=291.84
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Tax +
## PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - Tax 1 4.22 1407.9 290.16
## - IndusCat 1 4.46 1408.2 290.18
## - River 1 10.89 1414.6 290.66
## <none> 1403.7 291.84
## - PTRatio 1 109.12 1512.8 297.78
## - CrimeRateCat 2 141.36 1545.1 298.02
## - AvgRoom 1 390.26 1794.0 315.85
## - I(1/LStat) 1 417.23 1821.0 317.43
##
## Step: AIC=290.16
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + PTRatio +
## I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - IndusCat 1 2.16 1410.1 288.32
## - River 1 10.60 1418.5 288.96
## <none> 1407.9 290.16
## - PTRatio 1 127.90 1535.8 297.38
## - CrimeRateCat 2 276.62 1684.6 305.18
## - AvgRoom 1 388.90 1796.8 314.02
## - I(1/LStat) 1 421.75 1829.7 315.94
##
## Step: AIC=288.33
## MedPrice ~ CrimeRateCat + River + AvgRoom + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## - River 1 10.08 1420.2 287.08
## <none> 1410.1 288.32
## - PTRatio 1 129.70 1539.8 295.65
## - CrimeRateCat 2 414.20 1824.3 311.62
## - AvgRoom 1 390.57 1800.7 312.24
## - I(1/LStat) 1 421.76 1831.8 314.06
##
## Step: AIC=287.08
## MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat)
##
## Df Sum of Sq RSS AIC
## <none> 1420.2 287.08
## - PTRatio 1 121.95 1542.1 293.81
## - CrimeRateCat 2 409.31 1829.5 309.93
## - AvgRoom 1 400.15 1820.3 311.39
## - I(1/LStat) 1 420.46 1840.6 312.57
coef(backward_model_clean)
## (Intercept) CrimeRateCatMedium CrimeRateCatHigh AvgRoom
## 1.2525372 -0.6695093 -6.0447106 4.4965043
## PTRatio I(1/LStat)
## -0.6133069 43.2887552
summary(backward_model_clean)
##
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.081 -2.265 0.068 2.099 13.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2525 6.8234 0.184 0.85473
## CrimeRateCatMedium -0.6695 1.1804 -0.567 0.57186
## CrimeRateCatHigh -6.0447 1.1389 -5.308 6.70e-07 ***
## AvgRoom 4.4965 0.8471 5.308 6.69e-07 ***
## PTRatio -0.6133 0.2093 -2.930 0.00419 **
## I(1/LStat) 43.2888 7.9558 5.441 3.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared: 0.7946, Adjusted R-squared: 0.7843
## F-statistic: 77.36 on 5 and 100 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(backward_model_clean)
ad.test(backward_model_clean$residuals)
##
## Anderson-Darling normality test
##
## data: backward_model_clean$residuals
## A = 0.29576, p-value = 0.5886
anova(backward_model_clean)
## Analysis of Variance Table
##
## Response: MedPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## CrimeRateCat 2 2233.48 1116.74 78.634 < 2.2e-16 ***
## AvgRoom 1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio 1 214.14 214.14 15.078 0.0001853 ***
## I(1/LStat) 1 420.46 420.46 29.606 3.774e-07 ***
## Residuals 100 1420.17 14.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(backward_model_clean)
## [1] 1420.171
vif(backward_model_clean)
## GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108 2 1.106886
## AvgRoom 1.974105 1 1.405029
## PTRatio 1.423441 1 1.193081
## I(1/LStat) 2.430542 1 1.559020
#Findings: Both Forward and Backward selection produced the exact ouput as model 7 that we got from best subset.
When we dropped the Tax at 666 and did the building process again, we found that the model it picks is the same which is model 7 with 4 predictors but the predictors has changed where CrimeRate is dropped and Tax is added as the predictor where AvgRoom , PTRatio and 1/LSat is the same. The model with the tax dropped is slightly better as the adjR^2 is 3% more in the model where we have dropped the tax outliers.
#Notes 4/16/2025 Start with focus on demography. Figure out how much of the houses are on the river. See if there are not many on the river, like 6 or so, if so, they may be outliers and can be dropped. We only have 7 houses on the river, these might destory our analysis. Make a boxplot to check for outliers. If so, check to see if those prices are riverfront. If so, drop them. If not, keep them in.
Next, check age. We have a range from 3-100. Google and see what the cutoffs are, break it into categories.
Plot only y = B0 + B1x1. Make a bunch of residual plots to see if the model looks good. If several have a non constant variance , transform y variable using ln(y). Then, check to see if ln(y) = B0 + B1x1 If we see a curve indicating non linearity, add x^2 or whatever to give y = B0 + B1x1 + B2X^2. The VIF here will be huge but that doesn’t matter.
Everytime we do changes, we need to fit and check the residuals and see if it improves.
Once we decide on all the transformations, we start building. Not just best subsets, but also forward and backaward selection. Check if these models are in the best subset.
To check interactions with categorical variables, we would make a seperate chart for each option in the categorical variable to check each option to see if it interacted. In our case, plot Riverfront x1 and x2 vs medPrice.
It is very hard to find interaction in quantitative varables. We can cateorgize it and make plots out of it to check.